This notebook are complemented with new content sometimes. So, don't be scared if there isn't enough material or it not completed.

import numpy as np 
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# this code specifies a path for files, since I'm working on my local machine and pushing notebooks to kaggle
BASE_PATH = ""
if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    BASE_PATH = '/kaggle/input/playground-series-s5e1/'
else:
    BASE_PATH = 'kaggle/input/playground-series-s5e1/'

Exploratory Data Analysis (EDA)

Exploratory Data Analysis allows us to make a first look on dataset and get some insights.


Also, we can dive deeper into a problem, in our case sticker sales forecasting. By careful and thoughtful analysis we can discover some patterns which can help us to improve model's general perfomance.


Our objectives in this notebook:

  • Gain insights from data.
  • Vizualize distributions, relationships and patterns.
  • Identify anomalies, such as outliers, missing values etc.
  • Dive deeper into time series.
  • Train a model for forecasting sticker sales.

Let's take a look on data!

Dataset Overview

X_train = pd.read_csv(f'{BASE_PATH}train.csv')
X_test = pd.read_csv(f'{BASE_PATH}test.csv')
print(f'Train dataset contains {X_train.shape[0]} rows and {X_train.shape[1]} columns.')
X_train.head()
Train dataset contains 230130 rows and 6 columns.
id date country store product num_sold
0 0 2010-01-01 Canada Discount Stickers Holographic Goose NaN
1 1 2010-01-01 Canada Discount Stickers Kaggle 973.0
2 2 2010-01-01 Canada Discount Stickers Kaggle Tiers 906.0
3 3 2010-01-01 Canada Discount Stickers Kerneler 423.0
4 4 2010-01-01 Canada Discount Stickers Kerneler Dark Mode 491.0
X_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230130 entries, 0 to 230129
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   id        230130 non-null  int64  
 1   date      230130 non-null  object 
 2   country   230130 non-null  object 
 3   store     230130 non-null  object 
 4   product   230130 non-null  object 
 5   num_sold  221259 non-null  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 10.5+ MB
# Save 'id' column for submission
test_ids = X_test['id']

# Define the target column
target_column = 'num_sold'

# Select categorical and numerical columns (initial)
categorical_columns = X_train.select_dtypes(include=['object']).columns
numerical_columns = X_train.select_dtypes(exclude=['object']).columns

# Print out column information
print("Target Column:", target_column)
print("\nCategorical Columns:", categorical_columns.tolist())
print("\nNumerical Columns:", numerical_columns.tolist())
Target Column: num_sold

Categorical Columns: ['date', 'country', 'store', 'product']

Numerical Columns: ['id', 'num_sold']

A quick notice, in comparision with previous competitions this dataset contains a small amount of features. Mostly we have categorical features, so later we should try different options to achieve the best perfomance such as:

  • Different ways to encode features.
  • Feature engineering from existing features, especially from date column, because it can contain a useful patterns.

Descriptive Statistics

X_train.describe()
id num_sold
count 230130.000000 221259.000000
mean 115064.500000 752.527382
std 66432.953062 690.165445
min 0.000000 5.000000
25% 57532.250000 219.000000
50% 115064.500000 605.000000
75% 172596.750000 1114.000000
max 230129.000000 5939.000000
for column in categorical_columns:
    num_unique = X_train[column].nunique()
    print(f"'{column}' has {num_unique} unique categories.")
'date' has 2557 unique categories.
'country' has 6 unique categories.
'store' has 3 unique categories.
'product' has 5 unique categories.
# Print top 10 unique value counts for each categorical column
for column in categorical_columns:
    print(f"\nTop value counts in '{column}':\n{X_train[column].value_counts().head(10)}")
Top value counts in 'date':
date
2010-01-01    90
2014-09-05    90
2014-08-29    90
2014-08-30    90
2014-08-31    90
2014-09-01    90
2014-09-02    90
2014-09-03    90
2014-09-04    90
2014-09-06    90
Name: count, dtype: int64

Top value counts in 'country':
country
Canada       38355
Finland      38355
Italy        38355
Kenya        38355
Norway       38355
Singapore    38355
Name: count, dtype: int64

Top value counts in 'store':
store
Discount Stickers       76710
Stickers for Less       76710
Premium Sticker Mart    76710
Name: count, dtype: int64

Top value counts in 'product':
product
Holographic Goose     46026
Kaggle                46026
Kaggle Tiers          46026
Kerneler              46026
Kerneler Dark Mode    46026
Name: count, dtype: int64
print("The mean of columns:")
print(X_train[numerical_columns].mean())

print("\nThe std dev of columns:")
print(X_train[numerical_columns].std())

print("\nThe skewness of columns:")
print(X_train[numerical_columns].skew())
The mean of columns:
id          115064.500000
num_sold       752.527382
dtype: float64

The std dev of columns:
id          66432.953062
num_sold      690.165445
dtype: float64

The skewness of columns:
id         -6.374278e-16
num_sold    1.415373e+00
dtype: float64

Data Cleaning Insights

plt.figure(figsize=(15,9))
plt.title("Visualizing Missing Values")
sns.heatmap(X_train.isnull(), cbar=False, cmap=sns.color_palette('magma'), yticklabels=False)
plt.show()

Visual Exploration

filtered_columns = [col for col in categorical_columns if col != 'date']

fig, axes = plt.subplots(len(filtered_columns), 2, figsize=(15, 5 * len(filtered_columns)))

for i, column in enumerate(filtered_columns):
    sns.countplot(data=X_train, x=column, ax=axes[i, 0], palette='tab10')
    axes[i, 0].set_title(f'Distribution of {column}', fontsize=14)
    axes[i, 0].set_xlabel(column, fontsize=12)
    axes[i, 0].set_ylabel('Count', fontsize=12)
    sns.despine(ax=axes[i, 0])

    sns.boxplot(data=X_train, x=column, y=target_column, ax=axes[i, 1], palette='tab10')
    axes[i, 1].set_title(f'{column} vs {target_column}', fontsize=14)
    axes[i, 1].set_xlabel(column, fontsize=12)
    axes[i, 1].set_ylabel(target_column, fontsize=12)
    sns.despine(ax=axes[i, 1])

plt.tight_layout()   
plt.show()
sns.kdeplot(X_train['num_sold'])
<Axes: xlabel='num_sold', ylabel='Density'>

Conclusions

After EDA we can draw some conclusions about our data (not including time series for now):

  • Our dataset contains from 6 features. Our target variable is 'num_sold'.
  • Categorical features are uniformly distributed meaning that we have equal number of records of different categories.
  • Target variable contains missing values, which we should process carefully later, cause in time series context we forecasting our sales from previous data, so a wrong way for imputation can skew our results.
  • Num_sold contains a high number of outliers for different categories. From this we can suggest that at some specific time of year (e.g holidays) number of sales are increasing.
  • Num_sold is positively skewed (to right), so later we may consider to apply transformation to variable.

Time Series EDA

Previously we've done EDA for other features, but I decided to make analysis for time series as separate chapter, cause EDA for time series is different than other features.

Analysing Missing Values

In this section we'll have a look on data that contains NaN in our target variable.

missing_num_sold_df = X_train[X_train['num_sold'].isna()]
missing_num_sold_df
id date country store product num_sold
0 0 2010-01-01 Canada Discount Stickers Holographic Goose NaN
45 45 2010-01-01 Kenya Discount Stickers Holographic Goose NaN
90 90 2010-01-02 Canada Discount Stickers Holographic Goose NaN
135 135 2010-01-02 Kenya Discount Stickers Holographic Goose NaN
180 180 2010-01-03 Canada Discount Stickers Holographic Goose NaN
... ... ... ... ... ... ...
229905 229905 2016-12-29 Kenya Discount Stickers Holographic Goose NaN
229950 229950 2016-12-30 Canada Discount Stickers Holographic Goose NaN
229995 229995 2016-12-30 Kenya Discount Stickers Holographic Goose NaN
230040 230040 2016-12-31 Canada Discount Stickers Holographic Goose NaN
230085 230085 2016-12-31 Kenya Discount Stickers Holographic Goose NaN

8871 rows × 6 columns

Hmm, from first look it seems that all missing values are coming from the same place. Let's print value_counts for each categorical column

for c in ['country', 'store', 'product']:
    print(missing_num_sold_df[c].value_counts())
    print()
country
Kenya     4625
Canada    4246
Name: count, dtype: int64

store
Discount Stickers       5179
Stickers for Less       2666
Premium Sticker Mart    1026
Name: count, dtype: int64

product
Holographic Goose     8806
Kerneler                64
Kerneler Dark Mode       1
Name: count, dtype: int64

Indeed! Most of missing data are coming from Holographic Goose Product.

While exploring discussions I've come to conclusion that most suitable approach is to drop rows with missing data. The reason behind this that in time series forecasting is highly depending on previous values. If we've decided to impute data with mean for example, it would skew our results and perfomance will be poor.

Dropping Missing values

Typically it's not a part of EDA, but later we would decompose our time series, so we need to drop missing values accordingly.

X_train = X_train.dropna(axis=0)
X_train.isna().sum()
id          0
date        0
country     0
store       0
product     0
num_sold    0
dtype: int64
X_train.date = X_train.date.astype('datetime64[ns]')

Line plot of Total Sales Over Time

plt.figure(figsize=(28, 6))
X_train.groupby('date')['num_sold'].sum().plot(title='Total Sales Over Time', xlabel='Date', ylabel='Number of Products Sold')
plt.grid()
plt.show()

Decomposing Time Series

Time Series can be described as few individual components, which are including:

  • Trend: the long-term direction or pattern in data. Trend can be viewed as straight (increasing or decreasing) line or stable over time.
  • Seasonality: reflects to regular, repeating patterns withing the data. For example, increasing water bottles sales at summer, this happens every year or so.
  • Cyclic: similar to seasonality, but with difference that patterns don't happen in fixed time snaps.
  • Noise (Residuals): random variation or irregularity that remains after removing the trend and seasonal components.

Time Series Trend

X_data_index = X_train.set_index('date')
from statsmodels.tsa.seasonal import seasonal_decompose

monthly_data = X_data_index.resample('D').sum()

monthly_data['year'] = monthly_data.index.year
monthly_data['month'] = monthly_data.index.month
trend_data = []
for year in monthly_data['year'].unique():
    yearly_data = monthly_data[monthly_data['year'] == year]
    
    decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
    
    trend = pd.DataFrame({
        "trend": decomposition.trend,
        "month": yearly_data['month'],
        "year": year
    })
    trend_data.append(trend)

trend_df = pd.concat(trend_data)

plt.figure(figsize=(20, 10))
sns.lineplot(data=trend_df, x='month', y='trend', hue='year', palette='tab10')
plt.title("Trend Components Over Years")
plt.xlabel("Month")
plt.ylabel("Trend Component")
plt.grid(True)
plt.show()

Time Series Seasonal

seasonal_data = []
for year in monthly_data['year'].unique():
    yearly_data = monthly_data[monthly_data['year'] == year]
    

    decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
    
    seasonal = pd.DataFrame({
        "seasonal": decomposition.seasonal,
        "month": yearly_data['month'],
        "year": year
    })

    seasonal_data.append(seasonal)
    
seasonal_df = pd.concat(seasonal_data)
plt.figure(figsize=(20, 10))
sns.lineplot(data=seasonal_df, x='month', y='seasonal', hue='year', palette='tab10')
plt.title("Seasonal Components Over Years")
plt.xlabel("Month")
plt.ylabel("Seasonal Component")
plt.grid(True)
plt.show()
 

Time Series Residuals

resid_data = []
for year in monthly_data['year'].unique():
    yearly_data = monthly_data[monthly_data['year'] == year]
    

    decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
    
    resid = pd.DataFrame({
        "resid": decomposition.resid,
        "month": yearly_data['month'],
        "year": year
    })

    resid_data.append(resid)
    
resid_df = pd.concat(resid_data)
plt.figure(figsize=(20, 10))
sns.lineplot(data=resid_df, x='month', y='resid', hue='year', palette='tab10')
plt.title("Resid Components Over Years")
plt.xlabel("Month")
plt.ylabel("Resid Component")
plt.grid(True)
plt.show()

Stationary Testing

Let's check series stationarity by using Augmented Dickey–Fuller test

from statsmodels.tsa.stattools import adfuller

def test_stationarity(series, title=''):
    print(f"Results of ADF Test on {title}:")
    result = adfuller(series, autolag='AIC')
    print(f"ADF Statistic: {result[0]}")
    print(f"p-value: {result[1]}")
    if result[1] > 0.05:
        print("Non-stationary")
    else:
        print("stationary")
    for key, value in result[4].items():
        print(f"Critical Value ({key}): {value}")
    print("\n")

test_stationarity(X_train['num_sold'], 'Sold')
Results of ADF Test on Sold:
ADF Statistic: -32.10570496433421
p-value: 0.0
stationary
Critical Value (1%): -3.430379566523776
Critical Value (5%): -2.8615530680191887
Critical Value (10%): -2.5667769556355844


It seems that our time series is stationary, since p-value = 0.0 and so no further differencing is required before applying models like ARIMA or SARIMA.

Forecasting Time Series

In this section I will try to forecast time series by using statistical methods (ARIMA, SARIMA etc) and machine learning methods. It will be interesting to give it a try for different approaches.

Statistical approaches

ARIMA

ARIMA stands for AutoRegressive Integrated Moving Average

Finding ARIMA parameters

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(X_train['num_sold'], lags=40)
plot_pacf(X_train['num_sold'], lags=40)
plt.show()
train_end = '2015-01-01'
val_end = '2017-01-01'

# Filter data based on date ranges
train_data = X_train[X_train['date'] < train_end]
val_data = X_train[(X_train['date'] >= train_end) & (X_train['date'] < val_end)]

# Convert 'date' column to datetime
train_data['date'] = pd.to_datetime(train_data['date'])
val_data['date'] = pd.to_datetime(val_data['date'])

# Set 'date' column as the index for both train and validation data
train_data = train_data.set_index('date')
val_data = val_data.set_index('date')

# Check if the data was correctly processed
print(f"Train data: {train_data.shape}")
print(f"Validation data: {val_data.shape}")
Train data: (157849, 5)
Validation data: (63410, 5)
train_data.isna().sum()
id          0
country     0
store       0
product     0
num_sold    0
dtype: int64
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima.model import ARIMA

# Fit the ARIMA model on the training dataset
model_train = ARIMA(train_data['num_sold'], order=(1, 1, 1))
model_train_fit = model_train.fit()

# Check for model fit summary
print(model_train_fit.summary())

# Forecast on the test dataset
test_forecast = model_train_fit.get_forecast(steps=len(val_data))

# Check if forecast is not NaN
if test_forecast.predicted_mean.isna().any():
    print("Warning: Some forecasted values are NaN")

test_forecast_series = pd.Series(test_forecast.predicted_mean, index=val_data.index)

# Check if test_forecast_series contains NaNs
if test_forecast_series.isna().any():
    print("Warning: test_forecast_series contains NaNs")

# # Calculate the mean squared error
# mse = mean_squared_error(val_data['num_sold'], test_forecast_series)
# rmse = mse**0.5
                               SARIMAX Results                                
==============================================================================
Dep. Variable:               num_sold   No. Observations:               157849
Model:                 ARIMA(1, 1, 1)   Log Likelihood            -1244476.898
Date:                Sun, 05 Jan 2025   AIC                        2488959.797
Time:                        21:26:33   BIC                        2488989.705
Sample:                             0   HQIC                       2488968.691
                             - 157849                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.4506      0.002    188.554      0.000       0.446       0.455
ma.L1         -0.9997   6.22e-05  -1.61e+04      0.000      -1.000      -1.000
sigma2      4.125e+05    821.364    502.238      0.000    4.11e+05    4.14e+05
===================================================================================
Ljung-Box (L1) (Q):                 313.74   Jarque-Bera (JB):            281371.55
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               1.06   Skew:                             1.93
Prob(H) (two-sided):                  0.00   Kurtosis:                         8.29
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Warning: test_forecast_series contains NaNs
test_forecast_series
date
2015-01-01   NaN
2015-01-01   NaN
2015-01-01   NaN
2015-01-01   NaN
2015-01-01   NaN
              ..
2016-12-31   NaN
2016-12-31   NaN
2016-12-31   NaN
2016-12-31   NaN
2016-12-31   NaN
Name: predicted_mean, Length: 63410, dtype: float64
# Create a plot to compare the forecast with the actual test data
plt.figure(figsize=(14,7))
plt.plot(train_data['num_sold'], label='Training Data')
plt.plot(val_data['num_sold'], label='Actual Data', color='orange')
plt.plot(test_forecast_series, label='Forecasted Data', color='green')
plt.fill_between(val_data.index, 
                 test_forecast.conf_int().iloc[:, 0], 
                 test_forecast.conf_int().iloc[:, 1], 
                 color='k', alpha=.15)
plt.title('ARIMA Model Evaluation')
plt.xlabel('Date')
plt.ylabel('Number of Births')
plt.legend()
plt.show()